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Abstract. Following a recent surge in using history-based methods for resolving 
perceptual aliasing in reinforcement learning, we introduce an algorithm based 
on the feature reinforcement learning framework called $MDP 1 14 1. To create a 
practical algorithm we devise a stochastic search procedure for a class of context 
trees based on parallel tempering and a specialized proposal distribution. We pro- 
vide the first empirical evaluation for <PMDP. Our proposed algorithm achieves 
superior performance to the classical U-tree algorithm |21 1 and the recent active- 
LZ algorithm (6), and is competitive with MC-AIXI-CTW (28) that maintains 
a bayesian mixture over all context trees up to a chosen depth. We are encour- 
aged by our ability to compete with this sophisticated method using an algorithm 
that simply picks one single model, and uses Q-leaming on the corresponding 
MDP. Our <PMDP algorithm is much simpler, yet consumes less time and mem- 
ory. These results show promise for our future work on attacking more complex 
and larger problems. 



1 Introduction 

Reinforcement Learning (RL) l27ll aims to learn how to succeed in a task through trial 
and error. This active research area is well developed for environments that are Markov 
Decision Processes (MDPs); however, real world environments are often partially ob- 
servable and non-Markovian. The recently introduced Feature Markov Decision Process 
(<PMDP) framework [14] attempts to reduce actual RL tasks to MDPs for the purpose 
of attacking the general RL problem where the environment's model as well as the set 
of states are unknown. In 11261 . Sunehag and Hutter take a step further in the theoretical 
investigation of Feature Reinforcement Learning by proving consistency results. In this 
article, we develop an actual Feature Reinforcement Learning algorithm and empiri- 
cally analyze its performance in a number of environments. 

One of the most useful classes of maps (^s) that can be used to summarize histories 
as states of an MDP, is the class of context trees. Our stochastic search procedure, the 
principal component of our <PMDP algorithm GS<£A, works on a subset of all context 
trees, called Markov trees. Markov trees have previously been studied in [22 1 but under 
names like FSMX sources or FSM closed tree sources. The stochastic search procedure 
employed for our empirical investigation utilizes a parallel tempering methodology 171 . 
|[T2ll together with a specialized proposal distribution. In the experimental section, the 
performance of the ^MDP algorithm where stochastic search is conducted over the 
space of context-tree maps is shown and compared with three other related context 
tree-based methods. 
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Our <PMDP algorithm is briefly summarized as follows. First, perform a certain 
number of random actions, then use this history to find a high-quality map by minimiz- 
ing a cost function that evaluates the quality of each map. The quality here refers to 
the ability to predict rewards using the created states. We perform a search procedure 
for uncovering high-quality maps followed by executing Q-learning on the MDP whose 
states are induced by the detected optimal map. The current history is then updated with 
the additional experiences obtained from the interactions with the environment through 
Q-Learning. After that, we may repeat the procedure but without the random actions. 
The repetition refines the current "optimal" map, as longer histories provide more use- 
ful information for map evaluation. The ultimate optimal policy of the algorithm is 
retrieved from the action values Q on the resulting MDP induced from the final optimal 
map. 

Contributions. Our contributions are: extending the original <£MDP cost function pre- 
sented in [14 1 to allow for more discriminative learning and more efficient minimiza- 
tion (through stochastic search) of the cost; identifying the Markov action-observation 
context trees as an important class of feature maps for <PMDP; proposing the GS^A 
algorithm where several chosen learning and search procedures are logically combined; 
providing the first empirical analysis of the <PMDP model; and designing a specialized 
proposal distribution for stochastic search over the space of Markov trees, which is of 
critical importance for finding the best possible ^MDP agent. 

Related Work. Our algorithm is a history-based method. This means that we are utiliz- 
ing memory that in principle can be long, but in most of this article and in the related 
works is near term. Given a history h t of observations, actions and rewards we define 
states St = <P{h t ) based on some map <P. The main class of maps that we will consider 
are based on context trees. The classical algorithm of this sort is U-tree ED . which 
uses a local criterion based on a statistical test for splitting nodes in a context tree; 
while <PMDP employs a global cost function. Because of this advantage, <PMDP can 
potentially be used in conjunction with any optimization methods to find the optimal 
model. 

There has been a recent surge of interest in history based methods with the intro- 
duction of the active-LZ algorithm [6|, which generalizes the widely used Lempel-Ziv 
compression scheme to the reinforcement learning setting and assumes n-Markov mod- 
els of environments; and MC-AIXI-CTW ESI , which uses a Bayesian mixture of con- 
text trees and incorporates both the Context Tree Weighting algorithm [31 1 as well as 
UCT Monte Carlo planning [ 16 1. These can all be viewed as attempts at resolving per- 
ceptual aliasing problems with the help of short-term memory. This has turned out to be 
a more tractable approach than Baum- Welch methods for learning a Partially Observ- 
able Markov Decision Process (POMDP) [4| or Predictive State Representations [24|. 
The history based methods attempt to directly learn the environment states, thereby 
avoiding the POMDP-learning problem lfT31 . Il20l which is extremely hard to solve. 
Model minimization [8| is a line of works that also seek for a minimal representation 
of the state space, but focus on solving Markovian problems while ^MDP and other 
aforementioned history-based methods target non-Markovian ones. It is also worthy to 
note that there are various other attempts to find compact representations of MDP state 
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spaces Ifflfl : most of which, unlike our approach, address the planning problem where 
the MDP model is given 

Paper Organization. The paper is organized as follows. Section [2] introduces pre- 
liminaries on Reinforcement Learning, Markov Decision Processes, Stochastic Search 
methods and Context Trees. These are the components from which the <PMDP algo- 
rithm (GS^A) is built. In Section[3]we put all of the components into our ^MDP algo- 
rithm and also describe our specialized search proposal distribution in detail. Section|4] 
presents experimental results on four domains. Finally Section summarizes the main 
results of this paper, and briefly suggests possible research directions. 

2 Preliminaries 

2.1 Markov Decision Processes (MDP) 

An environment is a process which at any discrete time t, given action at&A produces 
an observation o t E(D and a corresponding reward r t GR. When the process is a Markov 
Decision Process ll27l ; Ot represents the environment state, and hence is denoted by 
St instead. Formally, a finite MDP is denoted by a quadruple (S,A,T,TZ) in which S 
is a finite set of states; A is a finite set of actions; T = (T s a s , : s,s' G S, a G A) is a 
collection of transition probabilities of the next state s t+ i = s' given the current state 
St = s and action a t = a; and R = (Rg S i : s,s' G S, a G A) is a reward function R® s , = 
E[?' t+ i |s t = s,a t — a,s t +i = s']. The return at time step t is the total discounted reward 
R t — rt+i+jr t+ 2+j 2 r t+ 3 + ..., where 7 is the geometric discount factor (0 <j< 1). 

Similarly, the action value in state s following policy it is defined as Q 7r (s,a) = 
Ejr [Rt \s t = s,a t = a] =E 7r [^^ 7 fe r t+/ ! C+ i \s t = s,a t = a]. For a known MDP, a useful 
way to find an estimate of the optimal action values Q* is to employ the Action- Value 
Iteration (AVI) algorithm, which is based on the optimal action-value Bellman equation 
1271 . and iterates the update Q(s,a)-^J2 s ' T ss' l R ss' +7max a /<3(s',a')]. 

If the MDP model is unknown, an effective estimation technique is provided by 
Q-learning, which incrementally updates estimates Q t through the equation 

Q(st,at)<-Q(st,at)+at(st,at)err t 

where the feedback error err t —r t +i+ , yma.x a Q(st+i,a)~Q(st,at), and at(st,at) is 
the learning rate at time t. Under the assumption of sufficient visits of all state-action 
pairs, Q-Learning converges if and only if some conditions of the learning rates are 
met 12, ll27l . In practice a small constant value of the learning rates (a(st,at) = rf) 
is, however, often adequate to get a good estimate of Q*. Q-Learning is off -policy; it 
directly approximates Q* regardless of what actions are actually taken. This approach 
is particularly beneficial when handling the exploration-exploitation tradeoff in RL. 

It is well known that learning by taking greedy actions retrieved from the current 
estimate Q of Q* to explore the state-action space generally leads to suboptimal behav- 
ior. The simplest remedy for this inefficiency is to employ the e-greedy scheme, where 
with probability e > we take a random action, and with probability 1 — e the greedy 
action is selected. This method is simple, but has shown to fail to properly resolve the 
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exploration-exploitation tradeoff. A more systematic strategy for exploring the unseen 
scenarios, instead of just taking random actions, is to use optimistic initial values [27], 
(3). To apply this idea to Q-Learning, we simply initialize Q(s,a) with large values. 
Suppose i? max is the maximal reward, Q initializations of at least ~^'"^ x are optimistic 

asQ{s,a)<f^. 

2.2 Feature Reinforcement Learning 

Problem description. An RL agent aims to find the optimal policy it for taking action 
at given the history of past observations, rewards and actions h t = o\r\a\...ot-\rt-\a,t-\OtTt 
in order to maximize the long-term reward signal. If the problem satisfies an MDP; as 
can be seen above, efficient solutions are available. We aim to attack the most challeng- 
ing RL problem where the environment's states and model are both unknown. In [ 13 1, 
this problem is named the Universal Artificial Intelligence (AI) problem since almost 
all AI problems can be reduced to it. 

#MDP framework. In [14], Hutter proposes a history -based method, a general statis- 
tical and information theoretic framework called ^MDP This approach offers a critical 
preliminary reduction step to facilitate the agent's ultimate search for the optimal pol- 
icy. The general ^MDP framework endeavors to extract relevant features for reward 
prediction from the past history h t by using a feature map <P: H^S, where T~L is the 
set of all finite histories. More specifically, we want the states St = <&(ht) and the re- 
sulting tuple {S, A, TV) to satisfy the Markov property of an MDP. As aforementioned, 
one of the most useful classes of <Ps is the class of context trees, where each tree maps 
a history to a single state represented by the tree itself. A more general class of <P is 
Probabilistic-Deterministic Finite Automata (PDFA) (29), which map histories to the 
MDP states where the next state can be determined from the current state and the next 
observation. The primary purpose of <PMDP is to find a map <P so that rewards of the 
MDP induced from the map can be predicted well. This enables us to use MDP solvers, 
like AVI and Q-learning, on the induced MDP to find a good policy. The reduction qual- 
ity of each <P is dictated by the capability of predicting rewards of the resulting MDP 
induced from that <P. A suitable cost function that measures the utility of ^s for this 
purpose is essential, and the optimal <P is the one that minimizes this cost function. 

Cost function. The cost used in this paper is an extended version of the original cost 
introduced in [ 14 1. We define a cost that measures the reward predictability of each <P, 
or more specifically of the resulting MDP induced from that <P. Based on this, our cost 
includes the description length of rewards; however, rewards depend on states as well, 
so the description length of states must be also added to the cost. In other words, the 
cost comprises coding of the rewards and resulting states, and is defined as follows: 

Cost Q (^|/i„):=aCL(si :n |ai :n ) + (l-a)CL(ri : „|si : „,ai : „) 

where si :n = Si,...,s n and a\- n = a%,...,a n and s t = ^(ht) and h t — ora\; t -\r t and 
< a < 1. For coding we use the two-part code [30|, [ 10 1, hence the code length (CL) 
is CL(x) = CL(x|6*)+CL(0) where x denotes the data sampled from the model speci- 
fied by parameters 9. We employ the optimal codes [5 1 for describing data CL(x\Q) = 
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\og(l/ Prg(x)), while parameters are uniformly encoded to precision 1/ \fl{x) where 
£(x) is the sequence length of x iflOll : CL(0) = ,!J1 ^-\ogl{x), here m is the num- 
ber of parameters. The optimal <P is found via the optimization problem (p°'P Umal — 
argmin^, Cost Q (<P | h„ ) . 

Denote n. := [rti n^... n{\ (I is determined in specific context); n + '=Ylj n j ( n j s 
are components of vector n.); | • | cardinality of a set; n*J, := \{t : (s t ,a t ,s t+ i,r t +i) = 
(s,a,s',r'), 1 <t <n}\; and H(p) = — Y^i=iP^°SPi Shannon entropy of a random vari- 
able with distribution p = [pi p2- ■ ■ pi] where Y^\=iPi = 1- The state and reward cost 
functions can, then, be analytically computed as follows: 

CX(« 1:n |a 1: „) = £CL«+) = $>#H fe) + ^log<+ 

s 7 a s,a \ s + / 

CL(r 1:n |* 1:n ,a l!n )= 2 CL«;,)= ^ „-+ H ^) +&^log<+ 

As we primarily want to find a that has the best reward predictability, the in- 
troduction of a is primarily to stress on reward coding, making costs for high-quality 
<Ps much lower with very small a values. In other words, a amplifies the differences 
among high-quality ^s and bad ones; and this accelerates our stochastic search process 
described below. 

We furthermore replace CL(x) with CLp(x) = CL(x|0)+/3CL(0) in Cost a to de- 
fine Cost Qi( 3 for the purpose of being able to select the right model given limited data. 
The motivation to introduce j3 is the following. For stationary environments the cost 
function is analytically of this form C\ x u(a) x 0(n) + C2 x v(a) x t((3) x 0(log(n)) 
where G\ ,0% are constants, and u,v,t are linear functions. The optimal <P should be the 
one with the smallest value of C\ xu(a), however, the curse here is that in practice 
C2 x v(a) is often big, so in order to obtain the optimal <P with limited data, a small 
value of (3 will help. We assert that with a very large number of samples n, a and f3 
can be ignored in the above cost function (use a = 0.5, /3 = 1 as the cost in [ 14 1). The 
choice of small a and (3 helps us more quickly to overcome the model penalty and find 
the optimal map. This strategy is a quite common practice in statistics, and even in the 
Minimum Description Length (MDL) community ITOl . For instance, AIC (T) uses a 
very small (3 = 2 /logn. 

The interested reader is referred to [14| for more detailed analytical formulas, and 
ll26l for further motivation and consistency proofs of the <PMDP model. 



2.3 Context Trees 

The class of maps that we will base our algorithm on is a class of context trees. 

Observation Context Tree (OCT). OCT is a class of maps <P used to extract relevant 
information from histories that include only past observations, not actions and rewards. 
The presentation of OCT is mainly to facilitate the definitions of the below Action- 
Observation Context Tree. 



6 Phuong Nguyen, Peter Sunehag, and Marcus Hutter 



Definition. Given an |0|-ary alphabet = {o 1 ,o 2 ,...,o^ }, an OCT constructed from 
the alphabet O is defined as a |0|-ary tree in which edges coming from any internal 
node are labeled by letters in O from left to right in the order given. 

Given an OCT T constructed from the alphabet 0, the state suffix set, or briefly 
state set S = {s 1 ,s 2 ,...,s m } C 0* induced from T is defined as the set of all possible 
strings of edge labels forming along a path from a leaf node to the root node of T. T 
is called a Markov tree if it has the so-called Markov property for its associated state 
set, that is, for every s % 6 S and o k G 0, s l o k has a unique suffix s J G S. The state set of 
a Markov OCT is called Markov state set. OCTs that do not have the Markov property 
are identified as non-Markov OCTs. Non-Markov state sets are similarly defined. 

Example. Figure [T(a)| A) and |l(a)[ B) respectively represent two binary OCTs of depths 
two and three; also Figures fl(b)[ A) and |l(b)[ B) illustrate two ternary OCTs of depths 
two and three. 




(a) Binary context trees (b) Trinary context trees 

Fig. 1. Context Trees 



As can be seen from Figure Q] trees |l(a)| A) and |l(b)[ A) are Markov; on the other 
hand, trees |l(aX B) and |l(b)[ B) are non-Markov. The state set of tree[T(ajA) is «S (a)(A) = 
{00,01,01,11}; and furthermore with any further observation o e O and s € S^ a ^ A \ 
there exists a unique s' ES which is a suffix of so. Hence, tree |l(a)[ A) is Markov. Table 
1(a) represents the deterministic relation between s, o and s'. 



(a) Markov property of <S (a)(A) 



(b) Non-markov property of 5^ a ^ s ' 



s 


00 01 10 11 


00 01 10 11 s 


001 101 11 


001 101 11 








1 





1 


s' 


00 10 00 10 


01 11 01 11 s' 





101 orOOl 11 11 11 



Table 1. Markov and Non-Markov properties 



However, there is no such relation in tree |l(aX B), or state set sOH 5 ) = {0,001,101,11}; 
for s = and o=l, it is ambiguous whether s' = 101 or 001. Table 1(b) clarifies the non- 
Markov property of tree |l(a)[ B). 

Similar arguments can be applied for trees |l(b)[ A) and |l(b)[ B) to identify their 
Markov property. 
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It is also worthy to illustrate how an OCT can be used as a map. We illustrate 
the mapping using again the OCTs in Figure Q] Given two histories including only 
past observations h 5 = 11101 and h' 6 = 211210, then ^ a ^ A \h 5 ) =0l,$( a K B \h 5 ) = 
101,$(W A >(fe£) = 10, and <2>W( B ) (h' 6 ) = 2W. 

Action-Observation Context Tree (AOCT). AOCTs are extended from the OCTs pre- 
sented above for the generic RL problem where relevant histories contain both actions 
and observations. 

Definition. Given two alphabets, = {o 1 ,o 2 ,...,o^ } named observation set, and A = 
{a 1 ,a 2 ,...,a'' 4 l} named action set, an AOCT constructed from the two alphabets is de- 
fined as a tree where any internal node at even depths has branching factor |0|, and 
edges coming from such nodes are labeled by letters in from left to right in the or- 
der given; and similarly any internal node at odd depths has branching factor \A\, and 
edges coming from these nodes are labeled by letters in A also from left to right in the 
specified order. 

The definitions of Markov and non-Markov AOCTs are similar to those of OCTs 
except that a next observation is now replaced by the next action and observation. For- 
mally, suppose T is an AOCT constructed from the above two alphabets; and S = 
{s 1 ,s 2 ,...,s m } C (Ax0)*UAx(Ax0)* is the state suffix set of the tree, then T is 
defined as a Markov AOCT if it has the Markov property, that is, for every 1 < i < m, 
1 < j < \A\, and 1 < k < |0| there exist a unique 1 < I < m such that s is a suffix 
of s l a^o k . AOCTs that do not have Markov property are categorized as non-Markov 
AOCTs. 

The total number of AOCTs up to a certain depth d, K(d), can be recursively com- 
puted via the formula K{d+2) = {[K(d)] ^ + 1}I I + 1 where K(0) = l,K(l) = 2. As 
can be easily seen from the recursive formula, the total number of AOCTs is doubly 
exponential in the tree depth. 

An important point to note here is that in our four experiments presented in Section 
2] the <P space is limited to Markov AOCTs, since as explained above, the state suffix set 
induced from a non-Markov AOCT does not represent an MDP state set; to put it more 
clearly, in non-Markov AOCTs, from the next action and observation, we cannot derive 
the next state from the current one. The Markov constraint on AOCTs significantly 
reduces the search space for our stochastic search algorithm. In the U-tree algorithm 
1211 . no distinction of Marov and non-Markov trees is identified; the algorithm attempts 
to search for the optimal tree over the whole space of AOCTs. 

2.4 Stochastic search 

While we have defined the cost criterion for evaluating maps, the problem of finding 
the optimal map remains. When the <P space is huge, e.g. context-tree map space where 
the number of ^>s grows doubly exponentially with the tree depth, exhaustive search is 
unable to deal with domains where the optimal <P is non-trivial. Stochastic search is a 
powerful tool for solving optimization problems where the landscape of the objective 
function is complex, and it appears impossible to analytically or numerically find the ex- 
act or even approximate global optimal solution. A typical stochastic search algorithm 
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starts with a predefined or arbitrary configuration (initial argument of the objective func- 
tion or state of a system), and from this generates a sequence of configurations based on 
some predefined probabilistic criterion; the configuration with the best objective value 
will be retained. There are a wide range of stochastic search methods proposed in the 
literature [23|; the most popular among these are simulated-annealing-type algorithms 
Ifl9ll , Il25ll . An essential element of a simulated-annealing (SA) algorithm is a Markov 
Chain Monte Carlo (MCMC) sampling scheme where a proposed new configuration 
y is drawn from a proposal distribution q(y\y), and we then change from configura- 
tion y to y with probability min{l, ^|~j^~j^ } where -kt is a target distribution. In a 
simulated-annealing (S A) algorithm where the traditional Metropolis-Hasting sampling 
scheme is utilized, ttt is proportional to e~^ x ^/ T if / is an objective function that we 
want to minimize, and T is some positive constant temperature. ^[-j^] is called the 
correction factor; it is there to compensate for bias in q. 

The traditional SA uses an MCMC scheme with some temperature-decreasing strat- 
egy. Although shown to be able to find the global optimum asymptotically J9), it gen- 
erally works badly in practice as we do not know which temperature cooling scheme is 
appropriate for the problem under consideration. Fortunately in the <PMDP cost function 
we know typical cost differences between two <Ps (C/3 x log(n)), so the range of appro- 
priate temperatures can be significantly reduced. The search process may be improved 
if we run a number of SA procedures with various different temperatures. Parallel Tem- 
pering (PT) Q, fl2l . an interesting variant of the traditional SA, significantly improves 
this stochastic search process by smartly offering a swapping step, letting the search 
procedure use small temperatures for exploitation and big ones for exploration. 

Parallel tempering. PT performs stochastic search over the product space X\ x ... x 
Xi(Xi = X VI <i <I), where X is the objective function's domain, and / is the parallel 
factor. Fixed temperatures Ti (i = 1,... and 1 < T\ < T2 < ... < Tj) are chosen for 
spaces Xi (i = l,... ,1). Temperatures Ti (i = l,...,I) are selected based on the following 
formula (y- — j^)\AH\ w — \ogp a where AH is the "typical" difference between 
function values of two successive configurations; and p a is the lower bound for the 
swapping acceptance rate. The main steps of each PT loop are as follows: 

- (xi ,Xj) is the current sampling; draw u~ Uniform[0,l] 

- If u<ceo, update every x\ tox l - t+1 ' 1 via some Markov Chain Monte Carlo (MCMC) 
scheme like Metropolis-Hasting (Parallel step) 

- If u > ao, randomly choose a neighbor pair, say i and i+ 1, and accept the swap of 

xf^ andx-^ with probability min{l, lrT '| I (y^ T ' +1 ^ 1 ' ) | } (Swapping step). 



The full details of PT are given in AlgorithmQ] 
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Algorithm 1 Parallel Tempering (PT) 



Require: An objective function h(x) to be minimized, or equivalently the target distribution 

tyc a e~ h( - x ^ c for some positive constant C 
Require: Swap probability parameter Qo 
Require: A proposal distribution q(y\x) 

Require: Temperatures Ti ,T2 ,. . . ,Tl, and number of iterations N 
1: Initialize arbitrary configurations x ( - 1,1 \...,x^ L ' 1 \ {a;' fe,l) : represents the i th value of x for 
temperature T k ;}) 
x opt 4- argmin a:=a .(.,i) h(x) 
for i = 1 to TV do 
for k = 1 to L do 

Sample y from the proposal distribution q(y\y) 
r^-min{l, ^ffjg|g } (Metropolis Hastings) 
Draw u~ Uniform[0,l] and update 
if u<r(y,y) then 

a; (fe ' l) 
else 

x^'^i-y 
end if 

ifh(z opt )>/i(a; (M) )then 

end if 
end for 

Draw Uniform[0,l] 
if u > qo then 

Draw a Uniform {1,...,L — 1} and let 6 = a+l 

'T,(' |t ' i| )'T t (- (, ' i »). 



21 

22 
23 
24 
25 
26 
27 



rf-min l, — , (a 8) . — b . (b t) . } 

»T I (l l °''')fT,,(l l '' , ' , ) J 

Draw w~ Uniform[0,l] 



if w < r then 

Swap a; (a ' l) and x^ 
end if 
end if 
end for 



Return x opt 



If its swapping phase is excluded, PT is simply the combination of a fixed number 
of Metropolis-Hastings procedures. The central point that makes PT powerful is its 
swapping step where adjacent temperatures interchange their sampling regions. This 
means that a good configuration can be allowed to use a cooler temperature and exploit 
what it has found while a worse configuration is given a higher temperature which 
results in more exploration. 
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3 The <£MDP Algorithm 

We now describe how the generic ^MDP algorithm works. The general algorithm is 
shown below (Algorithm |2). It first takes a number of random actions (5000 in all our 
experiments). Then it defines the cost function Cost a ^ based on this history. Stochastic 
search is then used to find a map <P with low cost. Based on the optimal <P the history is 
transformed into a sequence of states, actions and rewards. We use optimistic frequency 
estimates from this history to estimate probability parameters for state transitions and 
rewards. More precisely, we use ~ Rmax+, ' 1 l 4 ?" +r "' instead of the average ri+ '" +r '" to 

* m+l & tci 

estimate expected reward, where r\,...,r m are the rewards that have been observed for 
a certain state-action pair, and i? max is the highest possible reward. The statistics are 
used to estimate Q values using AVI. After this the agent starts to interact with the 
environment again using Q-learning initialized with the values that resulted from the 
performed AVI. The switch from AVI to Q-Learning is rather obvious, as Q-Learning 
only needs one cheap update per time step, while AVI requires updating the whole 
environment model and running a number of value iterations. The first set of random 
actions might not be sufficient to characterize what the best maps <P look like, so it 
might be beneficial to add the new history gathered by the Q-Learning interactions with 
the environment to the old history, and then repeat the process but without the initial 
sampling. 



Algorithm 2 Generic Stochastic <PMDP Agent (GS0A) 



Require: Environment; initials 'ampleN 'umber ; agentLearning Loops, 

stochasticlterations and additional S ampleN umber 
1: Generate a history h lnltlal of length initials ampleN umber 
2: h<-h mltlal 
3: repeat 

4: Run the chosen stochastic search scheme for the history h to find a $ with low cost 
5: Compute MDP statistics (optimistic frequency estimates R and T) induced from $ 
6: Apply AVI to find the optimal Q* values using the computed statistics R and T, 
7: Interact with environment for additional S ampleN umber iterations of Q-Learning us- 
ing Q* as initial values; the obtained additional history is stored in j l addlUonal 
8: h^[h,h additional ] 

9: agentLearningLoops <— agentLearningLoops — 1 
10: until agentLearningLoops = 

11: Compute the optimal policy n °P Umal f r0 m the optimal <t> and Q values 
Return [<p°P tima \ ^opumai^ 



In the first four experiments in Section|4] PT is employed to search over the <P space 
of Markov AOCTs. 



3.1 Proposal Distribution for Stochastic Search over the Markov-AOCT Space 

The principal optional component of the above high-level algorithm, GS^A, is a stochas- 
tic search procedure of which some algorithms have been presented in Section [2~4l In 



Feature Reinforcement Learning in Practice 



11 



these algorithms, an essential technical detail is the proposal distribution q. It is natu- 
ral to generate the next tree (the next proposal or configuration) from the current tree 
by splitting or merging nodes. It is possible to express the exact form of our proposal 
distribution, and based on this to explain how the next tree (next configuration) is pro- 
posed from the current tree (current configuration). However, the analytical form of the 
distribution is cumbersome to specify, so for better exposition we opt to describe the 
exact behavior of the tree proposal distribution instead. 

The stochastic search procedure starts with a Markov AOCT where all of the tree 
nodes are mergeable, and splittable. However, in the course of the search, a tree node 
might become unmergeable, but not the other way round; and a splittable node might 
turn to be unsplittable and vice versa. These specific transfering scenarios are described 
as follows. A mergeable tree node of the current tree becomes unmergeable if the cur- 
rent tree is proposed from the previous tree by splitting that node, and the cost of the 
current tree is smaller than that of the previous tree. A splittable leaf node of the cur- 
rent tree becomes unsplittable if the state associated with that node is not present in the 
current history; however, an unsplittable leaf node might revert to splittable when the 
state associated with that node is present in the future updated history. The constraint 
on merging is to keep good short-term memory for predicting rewards, while the other 
on splitting is simply following the Occam's razor principle. 

Merge and split permits. Given some current tree at a particular point in time of the 
stochastic search process, when considering the generation of the next tree proposal, 
most of the tree nodes, though labeled splittable and/or mergeable, might have no split, 
or merge permit, or neither. A node has split permit if it is a leaf node with splittable 
label. When a leaf node has been split, we simply add all possible children for this 
node, and label the edges according to the definition of AOCTs. As mentioned above, 
the newly added leaf nodes might be labeled unmergeable if the cost of the new tree is 
smaller than that of the old one; and these nodes might also be labeled unsplittable if 
the states associated with the new leaf nodes are not present in the current history. A 
node has merge permit if it is labeled mergeable, and all of its children are leaf nodes. 
When a tree node is merged, all the edges and nodes associated with its children are 
removed. 

Markov-merge and Markov-split permits. Since our search space is the class of 
Markov OACTs, whenever a split or merge occurs, extra adjustments might be needed 
to make the new tree Markov. After a split, there might be nodes that make the tree vi- 
olate the Markov assumption, and therefore, need to be split. After we split all of those 
we have to check again to see if any other nodes now need to be split. This goes on until 
we have a Markov AOCT again. The same applies to merging. 

When a node is Markov-split, it and all of the leaf nodes that need to be split (in- 
cluding recursive splits) as a consequence in order to make the tree Markov, are split. A 
tree node is said to have Markov-split permit if it, and all the other nodes that would be 
split in a Markov-split of the node, have split permits. This notion is best illustrated with 
an example. First we define Markov and Non-Markov states of an AOCT. A state of an 
AOCT is Markov if given any next action-observation pair, the next state is determined; 
otherwise it is labeled as non-Markov. Now in Figure|2A), suppose the current Markov 
AOCT is the tree without dashed edges. Then after splitting the leaf node marked by 



12 



Phuong Nguyen, Peter Sunehag, and Marcus Hutter 




Fig. 2. AOCT proposals 



* (the node associated with state 00101), the state 001 becomes non-Markov so this 
associated node needs to be split. However, after splitting this node (node associated 
with state 001), state becomes non-Markov, hence it needs splitting as well. In short, 
to split the node marked by *, the two nodes associated with states 001 and have to 
be split as well so as to ensure the resulting tree is Markov after splitting. Similarly, a 
tree node has Markov-merge permit if it, and all of the tree nodes that minimally and 
recursively need to be merged after the original node is merged in order to make the tree 
Markov, have merge permits. For example, in Figure |2jB), suppose the current tree is 
the tree including both solid and dashed edges, then the node marked by * has Markov- 
merge permit, if it itself, and the nodes associated with paths 001, 021 and 00101 that 
need to be merged, have merge permits. When a node with Markov-merge permit is 
Markov-merged, it and its Markov-merge-associated nodes are merged. 

Our procedure to generate the next tree from the current tree (draw sample from 
q(y\-)) in the space of Markov AOCTs consists of the following main steps: 

- From the given tree, identify two sets: one is N$ containing nodes with Markov- 
split permits, and the other Nm containing nodes with Markov-merge permits. 

- Suppose that either Ns or Nm is non-empty otherwise the algorithm (GS^A) must 
stop; then if either Ns or Nm is empty, select a node uniformly at random from the 
other set; otherwise select Ns or Nm randomly with probability h each, and after 
that choose a tree node randomly from the selected set. 

- Markov-split the node if it belongs to Ns, otherwise Markov-merge it 

Once we have drawn the new tree y, the Metropolis Hastings correction factor can be 
straightforwardly calculated via the formula 



here Ns and Nm are respectively the set of nodes with Markov-split permits, and the 
set of nodes with Markov-merge permits of y. 



q(y\y) 
q(y\y) 




if y is proposed from y by Markov-splitting 
if y is proposed from y by Markov-merging 
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Sharing. If the stochastic search algorithm utilized is PT, we apply another trick to ef- 
fectively accelerate the search process. Whenever a node is labeled unmergeable, that 
is, by splitting this node the cost function decreases, or in other words a good addi- 
tional relevant short-term memory for predicting rewards is found, the states associated 
with the new nodes created by the splitting are replicated in the trees with the other 
temperatures. 



4 Experiments 

4.1 Experimental Setup 



Parameter 


Component 


Value 


a 


Cost at p 


0.1 


P 


Cost at p 


0.1 


initial Sample Number 


GSiA. 


5000 


agentLearning Loops 


GS0A 


1 


Iterations 


PT 


100 


I 


PT 


10 


Ti,i<I 


PT 


Ti = f3xixlog(n) 


ato 


PT 


0.7 


7 


AVI, Q-Learning 


0.999999 


V 


Q-Learning 


0.01 



Table 2. Parameter setting for the GS&A algorithm 



Below in this section we present our empirical studies of the <PMDP algorithm 
GS<£A described in Section [3] For all of our experiments, stochastic search (PT) is 
applied in the <P space of Markov AOCTs. 

For a variety of tested domains, our algorithm produces consistent results using the 
same set of parameters. These parameters are shown in Table l4.ll and are not fine tuned. 

The results of ^MDP and the three competitors in the four above-listed environ- 
ments are shown in Figuresl3l l4l7l l8land ??. In each of the plots, various time points are 
chosen to assess and compare the quality of the policies learned by the four approaches. 
In order to evaluate how good a learned policy is, at each point, the learning process 
of each agent, and the exploration of the three competitors are temporarily switched 
off. The selected statistic to compare the quality of learning is the averaged reward over 
5000 actions using the current policy. For stability, the statistic is averaged over 10 runs. 

As shown in more detail below, <?MDP is superior to U-tree and active-LZ, and is 
comparable to MC-AIXI-CTW in short-term memory domains. Overall conclusions are 
clear, and we, therefore, omit error bars. 
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4.2 Environments and results 

We describe each environment, the resulting performance, and the tree that was found 
by ^MDP in the cheese maze domain. 

4x4 Grid. The domain is a 4x4 grid world. At each time step, the agent can move one 
cell left, right, up and down within the grid world. The observations are uninformative. 
When the agent enters the bottom-right corner of the grid; it gets a reward of 1, and is 
automatically and randomly sent back to one of the remaining 15 cells. Entering any cell 
other than the bottom-right one gives the agent a zero reward. To achieve the maximal 
total reward, the agent must be able to remember a series of smart actions without any 
clue about its relative position in the grid. 

The context tree found contains 34 states. Some series of actions that take the agent 
towards the bottom-right corner of the grid are present in the context tree. As shown in 
the 4x4-grid plot in Figure [3] after 5000 experiences gathered from the random policy, 
<PMDP finds the optimal policy, and so does MC-AIXI-CTW and U-Tree. Active-LZ, 
however, does not converge to an optimal policy even after 50,000 learning cycles. 

Tiger. The tiger domain is described as follows. There are two doors, left and right; 
an amount of gold and a tiger are placed behind the two doors in a random order. The 
person has three possible actions: listen to predict the position of the tiger, open the right 
door, and open the left door. If the person listens, he has to pay some money (reward of 
-1). The probability that the agent hears correctly is 0.85. If the person opens either of 
the doors and sees the gold, the obtained reward is 10; or otherwise he faces the tiger, 
then the agent receives a reward of -100. After the door is opened, the episode ends; 
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Fig. 3. 4x4 Grid 
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and in the next episode the tiger sits randomly again behind either the left or the right 
door. 
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Our parallel tempering procedure found a context tree consisting of 39 states in- 
cluding some important states where the history is such that the agent has listened a 
few times before opening the door. It can be seen from the tiger plot in Figure [4] that 
the optimal policy ^MDP found after 5,000 learning experiences does yield positive 
reward on average, while from time point 10,000 on, it achieves as high rewards as MC- 
AIXI-CTW. U-Tree appears to learn more slowly but eventually manages to get posi- 
tive averaged rewards after 50,000 cycles like $MDP and MC-AIXI-CTW. Active-LZ 
is performing far worse. The optimal policy that $MDP, MC-AIXI-CTW, and U-Tree 
ultimately found is the following. First listen two times, if the listening outcomes are 
consistent, open the predicted door with gold behind; otherwise take one more listening 
action, and based on the majority to open the appropriate door. 

Cheese Maze. This domain, as shown in Figure|5] consists of a eleven-cell maze with a 
cheese in it. The agent is a mouse that attempts to find the cheese. The agent's starting 
position for each episode is at one of the eleven cells uniformly random. The actions 
available to the agent are: move one cell left (0), right (1), up (2) and down (3). However, 
it should be noticed that if the agent hits the wall, its relative position in the maze 
remains unchanged. At each cell the agent can observe which directions among left, 
right, up and down the cell is blocked by a wall. If wall-blocking statuses of each cell are 
represented by 1 (blocked), and (free) respectively; then an observation is described by 
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Fig. 5. Cheese-maze domain 



a four-digit binary number where the digits from left to right are wall-blocking statuses 
of up, left, down and right directions. For example, 0101 = 5, 01 1 1 = 7, ... as described 
in Figure|5] The agent gets a reward of -1 when moving into a free cell without a cheese; 
hitting the wall gives it a penalty of -10; and a reward of 10 is given to the agent when 
it finds the cheese. As can be seen, some observations themselves alone are insufficient 
for the mouse to locate itself unambiguously in the maze. Hence, the mouse must learn 
to resolve these ambiguities of observations in the maze to be able to find the optimal 
policy. 




Fig. 6. Cheese-maze tree 



Our algorithm found a context tree consisting of 43 states that contains the tree as 
shown in Figure [6] The tree splits from the root into the 6 possible observations. Then 
observations 5 and 10 are split into the four possible actions; and some of these actions, 
the ones that come from a different location and not a wall collision, are split further 
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into the 6 "possible" observations before that. This resolves which 5 or which 10 we are 
at. The states in this tree resolve the most important ambiguities of the raw observations 
and an optimal policy can be found. The domain contains an infinite amount of longer 
dependencies among which our found states pick up a small subset. The cheese-maze 
plot in Figure [7] shows that after the initial 5000 experiences, ^MDP is marginally 
worse than MC-AIXI-CTW but is better than U-Tree and Active-LZ. From time point 
10,000, there is no difference between <2>MDP and MC-AIXI-CTW. U-Tree and Active- 
LZ remain inferior. 
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Fig. 7. Cheese maze 



Kuhn Poker. In Kuhn poker [ 17 1 a deck of only three cards (Jack, Queen and King) is 
used. The agent always plays second in any game (episode). After putting a chip each 
into play, the players are dealt a card each. Then the first player says bet or pass and 
the second player chooses bet or pass. If player one says pass and player two says bet 
then player one must choose again between bet and pass. Whenever a player says bet 
they must put in another chip. If one player bets and the other pass the better gets all the 
chips in play. Otherwise the player with the highest card gets the chips. Player one plays 
according to a fixed but stochastic Nash optimal strategy [11 1. <PMDP finds 89 states. 
It can be observed from the Kunh-poker plot in Figure [8]that <PMDP is comparable to 
MC-AIXI-CTW and much better than U-Tree and Active-LZ, who loose money. 
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5 Conclusions 

Based on the Feature Reinforcement Learning framework 1 14 1 we defined actual prac- 
tical reinforcement learning agents that perform very well empirically. We evaluated a 
reasonably simple instantiation of our algorithm that first takes 5000 random actions 
followed by finding a map through a search procedure and then it performs Q-learning 
on the MDP defined by the map's state set. 

We performed an evaluation on four test domains used to evaluate MC-AIXI-CTW 
in |f28|. Those domains are all suitably attacked with context tree methods. We defined 
a ^MDP agent for a class of maps based on context trees, and compared it to three 
other context tree-based methods. Key to the success of our ^MDP agent was the de- 
velopment of a suitable stochastic search method for the class of Markov AOCTs. We 
combined parallel tempering with a specialized proposal distribution that results in an 
effective stochastic search procedure. The <PMDP agent outperforms both the classical 
U-tree algorithm lETI and the recent Active-LZ algorithm [6|, and is competitive with 
the newest state of the art method MC-AIXI-CTW (28]. The main reason that <£MDP 
outperforms U-tree is that ^MDP uses a global criterion (enabling the use of power- 
ful global optimizers) whereas U-tree uses a local split-merge criterion. ^MDP also 
performs significantly better than Active-LZ. Active-LZ learns slowly as it overesti- 
mates the environment model (assuming n-Markov or complete context-tree environ- 
ment models); and this leads to unreliable value-function estimates. 

Below are some detailed advantages of <PMDP over MC-AIXI-CTW: 



Feature Reinforcement Learning in Practice 19 



- ^MDP is more efficient than MC-AIXI-CTW in both computation and memory 
usage. <PMDP only needs an initial number of samples and then it finds the optimal 
map and uses AVI to find MDP parameters. After this it only needs a Q-learning 
update for each iteration. On the other hand, MC-AIXI-CTW requires model updat- 
ing, planning and value-reverting at every single cycle which together are orders of 
magnitude more expensive than Q-learning. In the experiments ^MDP finished in 
minutes while MC-AIXI-CTW needed hours. Another disadvantage of MC-AIXI- 
CTW is that it is a memory-hungry algorithm. ^MDP learns the best tree repre- 
sentation using stochastic search, which expands a tree towards relevant histories. 
MC-AIXI-CTW learns the mixture of trees where the number of tree nodes grows 
(and thereby the memory usage) linearly with time. 

- ^MDP learns a single state representation and can use many classical RL algo- 
rithms, e.g. Q-Learning, for MDP learning and planning. 

- Another key benefit is that ^MDP represents a more discriminative approach than 
MC-AIXI-CTW since it aims primarily for the ability to predict future rewards and 
not to fully model the observation sequence. If the observation sequence is very 
complex, this becomes essential. 

On the other hand, to be fair it should be noted that compared to <PMDP, MC-AIXI- 
CTW is more principled. The results presented in this paper are encouraging since they 
show that we can achieve comparable results to the more sophisticated MC-AIXI-CTW 
algorithm on problems where only short-term memory is needed. We plan to utilize the 
aforementioned advantages of the ^MDP framework, like flexibility in environment 
modeling and computational efficiency, to attack more complex and larger problems. 
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